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ABSTRACT 

Entity resolution is central to data integration and data 
cleaning. Algorithmic approaches have been improving in 
quality, but remain far from perfect. Crowdsourcing plat- 
forms offer a more accurate but expensive (and slow) way 
to bring human insight into the process. Previous work 
has proposed batching verification tasks for presentation to 
human workers but even with batching, a human-only ap- 
proach is infeasible for data sets of even moderate size, due 
to the large numbers of matches to be tested. Instead, we 
propose a hybrid human-machine approach in which ma- 
chines are used to do an initial, coarse pass over all the data, 
and people are used to verify only the most likely matching 
pairs. We show that for such a hybrid system, generating the 
minimum number of verification tasks of a given size is NP- 
Hard, but we develop a novel two-tiered heuristic approach 
for creating batched tasks. We describe this method, and 
present the results of extensive experiments on real data sets 
using a popular crowdsourcing platform. The experiments 
show that our hybrid approach achieves both good efficiency 
and high accuracy compared to machine-only or human-only 
alternatives. 

1. INTRODUCTION 

Entity resolution (also known as entity reconciliation, 
duplicate detection, record linkage and merge/purge) in 
database systems is the task of finding different records that 
refer to the same entity. Entity resolution is particularly im- 
portant when cleaning data or when integrating data from 
multiple sources. In such scenarios, it is not uncommon for 
records that are not exactly identical to refer to the same 
real-world entity. For example, consider the table of product 
data shown in Table 1. Records n and r2 in the table have 
different text in the Product Name field, but refer to the 
same product. Our goal is to find all such duplicate records. 

There has been significant work in developing automated 
algorithms for entity resolution (see [11] for a recent survey). 
A basic machine-based technique is to compute a pre-defined 
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Table 1: A table of products. 



ID 


Product Name 


Price 


n 


iPad Two 16GB WiFi White 


$490 


n 


iPad 2nd generation 16GB WiFi White 


$469 


r3 


iPhonc 4th generation White 16GB 


$545 


TA 


Apple iPhone 4 16GB White 


$520 


rh 


Apple iPhone 3rd generation Black 16GB 


$375 


r& 


iPhone 4 32GB White 


$599 


r 7 


Apple iPad2 16GB WiFi White 


$499 


rs 


Apple iPod shuffle 2GB Blue 


$49 


r% 


Apple iPod shuffle USB Cable 


$19 



similarity metric, such as Jaccard similarity, for each pair of 
records [2,5,26]. Records whose similarity values are above 
a specified threshold are considered to refer to the same en- 
tity. More sophisticated techniques use machine learning. 
For example, some approaches model entity resolution as a 
classification problem, training a classifier to distinguish be- 
tween "duplicate" or "non-duplicate" pairs [4,6] . Despite all 
of this progress, machine-based techniques remain far from 
perfect. For example, a recent study [18] describes the diffi- 
culty that state-of-the-art techniques have in many domains 
such as identifying duplicate products based on their textual 
descriptions. 

The limitations of machine-based approaches combined 
with the availability of easily-accessible crowdsourcing plat- 
forms have caused many to turn to human-based approaches. 
Indeed, de-duplication (of addresses, names, product de- 
scriptions, etc.) is an important use case for popular crowd- 
sourcing platforms such as Amazon Mechanical Turk (AMT) 
and Crowdflower. Such platforms support crowdsourced ex- 
ecution of "microtasks" or Human Intelligence Tasks (HITs) , 
where people do simple jobs requiring little or no domain 
expertise, and get paid on a per-job basis. Entity resolu- 
tion is easily expressed as a query in a crowd-enabled query 
processing system such as CrowdDB [13] or Qurk [20]. For 
example, in CrowdDB's CrowdSQL language, the following 
self-join query identifies duplicate product records. 

SELECT p. id, q.id FROM product p, product q 
WHERE p.product_name "= q.product_name; 

Note that ~= is an operator that can ask the crowd to decide 
whether or not p .product _name and q. product _name refer 
to the same product. 

A naive way to process such a query is to create a HIT for 
each pair of records, and for each HIT, to ask people in the 
crowd to decide whether or not the two records both refer 
to the same entity. For a table with n records, the naive 
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execution method will lead to 0(n 2 ) HITs. In a recently 
published paper [19], Marcus et al. proposed two batch- 
ing strategies to reduce the number of HITs for matching 
operations (they focused on join operations). The first is 
to simply place multiple record pairs into a single HIT. To 
perform such a pair-based HIT, a worker must check each 
pair in the HIT individually. Suppose a pair-based HIT 
consists of k pairs of records. The number of HITs will be 
reduced to 0(n 2 /k). Their second batching approach also 
placed multiple pairs of records in a single HIT but they 
asked workers to find all matches among all of the records. 
In this latter approach, if a HIT consists of k records the 
number of HITs will be reduced to 0(n 2 /k 2 ). Their results 
indicated that batching could provide significant benefits in 
cost with only minimal negative impact on accuracy. How- 
ever, their approach suffers from a scalability problem. Even 
with a modest database size of 10,000 records, assuming a 
reasonable HIT size k — 20, their approaches would require 
5,000,000 and 250,000 HITs respectively. At even $0.01 per 
HIT, this query would cost $50,000 or $2,500 to execute, if 
in fact, it could even be successfully executed on existing 
platforms. 

While the insight that batching can reduce the number 
of HITs is a correct one, clearly batching on its own is not 
sufficient to enable Entity Resolution to be done at scale. 
Instead, what is needed is a hybrid human-machine ap- 
proach, that uses machine-based techniques to weed out ob- 
vious non-duplicates, while using precious human resources 
to examine just those cases where human insight is needed. 
Similar hybrid approaches have been shown to be effective 
for problems such as image search [27] and language trans- 
lation [23]. Following this line of research, we propose a 
hybrid human-machine Entity Resolution approach called 
CrowdER. CrowdER first uses machine-based techniques to 
discard those pairs of records that look very dissimilar, and 
only asks the crowd to verify the remaining pairs. 

Having machine-based similarity estimates raises a further 
opportunity. Namely, the similarity estimates can be used to 
further optimize the grouping of specific records into HITs. 
Similar to Marcus et al., we consider batching of records into 
HITs as groups of individual pairs ( "pair-based HIT gener- 
ation" ) or in a format that requires workers to find matches 
among all the records ("cluster-based HIT generation") but 
with the major difference that we use machine-generated 
similarity estimates to drive the creation of the HITs. We 
formulate the HIT generation problem and show that gen- 
erating the minimum number of cluster-based HITs is NP- 
Hard. Thus, we develop a heuristic-based algorithm for gen- 
erating cluster-based HITs and evaluate the algorithm an- 
alytically and through an extensive experimental analysis 
using real data sets and a popular crowdsourcing platform. 

The main contributions are the following: 

• We propose a hybrid human-machine system for entity 
resolution by combining human-based techniques and 
machine-based techniques. 

• We formulate the cluster-based HIT generation prob- 
lem, prove that it is an NP-Hard problem and develop 
a two-tiered heuristics-based solution for it. 

• We compare pair-based HIT generation and cluster- 
based HIT generation analytically and experimentally. 

• We have implemented our approaches and compared 
them with the state-of-the-art techniques on real 



datasets using the AMT platform. Experimental re- 
sults show that our approach reduces cost while pro- 
viding good answer quality. In particular, our hybrid 
human-machine approach makes it practical to bring 
humans into the Entity Resolution process. 

The remainder of this paper is organized as follows. We 
propose a hybrid human-machine approach for entity reso- 
lution in Section 2. Section 3 introduces the HIT generation 
problem and proves that it is NP-Hard. We present an ap- 
proximation algorithm in Section 4 and devise a more prac- 
tical, two-tiered approach in Section 5. Section 6 compares 
pair-based HIT and cluster-based HIT generation analyti- 
cally. We describe the results of our experimental studies 
in Section 7. Related work is reviewed in Section 8 and we 
present conclusions and future work in Section 9. 

2. ENTITY RESOLUTION TECHNIQUES 

In this section, we first review existing machine-based 
techniques for entity resolution and then describe a hybrid 
workflow combining people and machines. 

2.1 Machine-based Techniques 

Machine-based Entity Resolution techniques can be 
broadly divided into two categories, similarity-based and 
learning-based. 

2.7.7 Similarity-based 

Similarity-based techniques require a similarity function 
and a threshold. The similarity function takes a pair of 
records as input, and outputs a similarity value. The more 
similar the two records, the higher the output value. The 
basic approach is to compute the similarity of all pairs of 
records. If a pair of records has a similarity value no smaller 
than the specified threshold, then they are considered to 
refer to the same entity. 

For example, in Table 1, suppose that the similarity of 
two records is specified as Jaccard similarity between their 
Product Names, and the specified threshold is 0.5. Jaccard 
similarity over two sets is defined as the size of the set in- 
tersection divided by the size of the set union. For example, 
the Jaccard similarity between the Product Names of r± and 
r-2 is 

IliPad, 16GB, WiFi, White) I 

J (r u r 2 ) = 77 - ■ ! ■ rr = 0.57. 

|{iPad, 16GB, WiFi, White, Two, 2nd, generation}! 

The similarity-based technique will consider (n,r2) as re- 
ferring to the same entity since their Jaccard similarity is 
no smaller than the threshold, i.e., J(ri,r2) > 0.5. Simi- 
larly, (n, r3) will not be considered a match since J(ri, rz) = 
0.25 < 0.5. 

Since it is expensive to compute the similarity for every 
pair of records, research on similarity-based techniques [2, 
5,26] mainly focuses on how to reduce the number of pairs 
evaluated. 

2.7.2 Learning-based 

Learning-based techniques model entity resolution as a 
classification problem [4,6]. They represent a pair of records 
as a feature vector in which each dimension is a similar- 
ity value of the records on some attribute. If we choose n 
similarity functions on m attributes, then the feature vec- 
tor will be a nm-dimensional feature vector. For example, 
for the records in Table 1, suppose we only choose Jaccard 
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A table of records 



Pairs of records that 
refer to the same entity 



Machine-based 
Techniques 



3E 



A crowd of humans 



Pairs of records 
whose likelihoods > 
a specified threshold 



Human Intelligence 
Tasks (HITs) 



HIT Generation 



Figure 1: Hybrid human-machine workflow. 

similarity on Product Name. Then each pair of records will 
be represented as a feature vector that contains only a sin- 
gle dimension. Learning-based techniques require a training 
set to train the classifier. The training set consists of posi- 
tive feature vectors and negative feature vectors indicating 
matching pairs and non-matching pairs respectively. The 
trained classifier can then be applied to label new record 
pairs as matching or non-matching. 

2.2 Hybrid Human-Machine Workflow 

As described in the introduction, people are often better 
than algorithms at detecting when different terms actually 
refer to the same entity. However, compared to algorithmic 
techniques people are much slower and more expensive. A 
hybrid human-machine approach has the potential to com- 
bine the efficiency of machine-based approaches with the 
answer quality that can be obtained from people. The intu- 
itive idea is that among pairs of records that can be 
generated from a set of n records, a large number of pairs 
are very dissimilar. Such pairs can be easily pruned using a 
machine-based technique. People can then be brought in to 
examine the remaining pairs. 

Based on this idea, we propose a human-machine workflow 
as shown in Figure 1. The workflow first uses machine-based 
techniques to compute for each pair the likelihood that they 
refer to the same entity. 1 For example, the likelihood could 
be the similarity value given by a similarity-based technique. 
Then, only those pairs whose likelihood exceeds a specified 
threshold are sent to the crowd. In the experimental sec- 
tion, we show that by specifying a relatively low threshold 
we can dramatically reduce the number of pairs that need 
to be verified with only a minor loss of quality. Given the 
set of pairs to be sent to the crowd, the next step is to gen- 
erate HITs so that people can check them for matches. HIT 
Generation is a key component of our workflow. Finally, 
generated HITs are sent to the crowd for processing and the 
answers are collected. 

Example 1. Consider the nine records in Table 1. In- 
stead of asking people to check 2±i = 36 pairs, our workflow 
first employs a machine-based technique to compute the like- 
lihood of each pair of records. Here we use the Jaccard sim- 
ilarity between Product Names of a pair of records as their 
likelihood. Then the workflow prunes the pairs whose like- 
lihood is smaller than the specified threshold. Suppose the 
threshold is 0.3. Figure 2(a) shows the remaining ten pairs 



0.3- 



(1-1,12,0.57) 
(r 4 , re, 0.50) 
(r,, r 7 , 0.43) 
(r 3 , r„, 0.43) 
(14, r 7 , 0.43) 
(r 8 , r 9 , 0.43) 
(r 2 , r 3 , 0.38) 
(r 2 , r>, 0.38) 
(r 3 , r 5 , 0.38) 
(r 4 , r 5 , 0.38) 



(r 3 J6, 029) 
( ri ,?S/.25) 



(r,, r 2 ) ®yes ono 
(r 4 , r 6 ) oyes ono 



(n, r 7 ) ®YES ONO 
(r 3 , r 4 ) ®YES ono 



(r 4 , r 7 ) oyes ®N0 
(r 8 , r 9 ) oyes ®no 



(12, r 3 ) oyes ®no 
(r 2 , r 7 ) ®yes ono 



fe r 5 ) oyes ®no 
(r 4 , r 5 ) oyes ®no 




(a) Remove the pairs whose (b) Generate HITs to verify 
likelihood < 0.3 the pairs of records 



(c) Output matching 
pairs 



In practice, wc can adopt some indexing techniques such as blocking 
and Q-gram based indexing [7] to avoid all-pairs comparison. 



Figure 2: An example of using the hybrid human- 
machine workflow to And duplicate pairs in Table 1. 

of records. That is, the workflow only needs to generate HITs 
to verify the ten pairs (rather than 36). In this example, we 
batch two pairs into each HIT, and generate five HITs as 
shown in Figure 2(b). As an example, in the first HIT, the 
crowd selects "YES" for (n,r2) and "NO" for (rn,r&) which 
indicates that r\ and r-2 are the same entity while r^ and 
re are not. After the crowd finishes all the HITs, the four 
matching pairs (as determined by the crowd) are returned 
(Figure 2(c)). 

3. HIT GENERATION 

Recall that a key step in the human-machine workflow 
is HIT Generation. That is, given a set of pairs of records, 
they must be combined into HITs for crowdsourcing. In this 
section we discuss the generation problem for two types of 
HITs, pair-based HITs and cluster-based HITs. 

3.1 Pair-based HIT Generation 

A pair-based HIT consists of multiple pairs of records to 
be compared batched in to a single HIT. For each pair of 
records, the crowd needs to verify whether they refer to the 
same entity or not. Figure 3 shows an example of the user 
interface we generate for a pair-based HIT on AMT. At the 
top, there is a brief description of the HIT. More detailed in- 
structions can be displayed by clicking "Show Instructions" . 
The HIT shown consists of two pairs of records. For each 
pair, the worker needs to choose either "They are the same 
product" or "They are different products" . The HIT can be 
submitted only if a selection has been made for all pairs of 
records in the HIT. Note that in Figure 3, the second pair 
of records has not been verified, so the submit button is dis- 
abled, and its caption shows "1 left". We also recommend 
(but do not require) that workers provide reasons for their 
choices. 

Generating pair-based HITs is straightforward. Suppose 
a pair-based HIT can contain at most k pairs. Given a set 
of pairs, V, we need to generate \^jr-~\ pair-based HITs. For 
example, for the ten pairs of records with above-threshold 
likelihood in Figure 2(a), if k = 2, we would need to generate 
five pair-based HITs. 

3.2 Cluster-based HIT Generation 

A cluster-based HIT consists of a group of individual 
records rather than pairs. Workers are asked to find all 
duplicate records in the group. Figure 4 shows the user in- 
terface we generate for a cluster-based HIT on AMT. As 
with pair-based HITs, there is a brief description at the top, 
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Decide Whether Two Products Are the Same ( Show Instructions ) 



Product Pair #1 



Product Name 


Price 


iPad Two 16GB Wi White 


$490 


iPad 2nd generation 16GB Wi White 


$469 



Your Choice (Required) 

a They are the same product 
They are different products 
Reasons for Your Choice (Optional) 



Product Pair #2 



Product Name 


Price 


iPad 2nd generation 1 6GB WiFi White 




iPhone 4 th generation White 1 6 GB 


$545 



Your dunce (Required) 

They are the same product 
O They are different products 
Reasons for Your Choice (Optional) 



Submit (1 left) 



Figure 3: A pair-based HIT with two record pairs. 

and more detailed instructions are available. The example 
HIT contains four records. A drop-down list at the front 
of each record allows a worker to assign each record a la- 
bel. Initially, all records are unlabeled. When a label is 
selected for a record, the background color of the record is 
changed to the corresponding color for that label. Workers 
indicate duplicate records by assigning them the same label 
(and thus, the same color). 

To make the labeling process more efficient, our interface 
supports two additional features, (1) sorting records by col- 
umn values by clicking a column header; (2) moving a record 
by dragging and dropping it. The first feature can be used 
for example, to sort the records based on a specific attribute 
such as product price. The second feature can be used, for 
example, to place the records that share a common word, 
e.g. "ipad" near each other for easier comparison. 

Next we study how to generate cluster-based HITs. A 
cluster-based HIT allows a pair of records to be matched iff 
both records are in the HIT. In a crowdsourced system like 
AMT, payment is made for each successfully completed HIT. 
Thus, there is a financial incentive to minimize the number 
of HITs. However, placing too many records in a cluster- 
based HIT makes it difficult for workers to complete, re- 
sulting in higher-latencies and lower quality answers. Thus, 
we bound the number of records placed in a cluster-based 
HIT. We can then formulate the HIT generation problem as 
follows: 

Definition 1 (Cluster-based HIT Generation). 
Given a set of pairs of records, V , and a cluster-size 
threshold, k, the cluster-based HIT generation problem is 
to generate the minimum number of cluster-based HITs, 
Hi, H2, • • • , Hh, that satisfy two requirements: (1) \He\ < k 
for any i G [1, h], where \He\ denotes the number of records 
in He; (2) for any (r%,rj) £ V, there exists Hi (I G [1, h]) 
s.t. n € Hi and rj £ He. 

For example, consider the ten pairs of records in Fig- 
ure 2(a). Given the cluster-size threshold k = 4, suppose 



Find Duplicate Products In the Table. ( Show Instructions ) 

Tips: you can (1 ) SORT the table by clicking headers; 

(2) MOVE a row by dragging and dropping it 





Product Name 


Price *■ 


FH 


iPad 2nd generation 1 6GB WiFi White 


$469 


1 E 


iPad Two 16GB WiH White 


$400 


2 E 


Apple iPhone 4 1 6GB White 


$520 


I- 




iPhone 4 th generation White 1 6 GB 


$545 




i 


.easons for Your Answers (Optional) 







Submit (1 left) 



Figure 4: A cluster-based HIT with four records. 

we generate three cluster-based HITs, H\ — {ri, T2, ^3, r7}, 
H2 = {r3,r4,r 5 ,r 6 } and Hz = {rt, r 7 , r 8 , r 9 }. As their sizes 
are no larger than k = 4, the first requirement of Definition 1 
holds. For any of the ten pairs, at least one of the three 
cluster-based HITs contain them, thus the second require- 
ment of Definition 1 holds. Furthermore, it is impossible to 
find fewer cluster-based HITs that satisfy the two require- 
ments. Therefore, based on Definition 1, Hi, H2 and H3 are 
the solution to the cluster-based HIT generation problem. 

Unfortunately, the cluster-based HIT generation problem 
is NP-Hard. In the next section, we present an approxima- 
tion algorithm for this problem. 

Theorem 1. The cluster-based HIT generation problem 
is NP-Hard. 

Proof. We prove it by reduction from the fc-clique cover- 
ing problem [15]. A fc-clique is defined as a complete graph 
that contains k vertices. We say that a fc-clique covers an 
edge of a graph if the two vertices of the edge are both in the 
clique. Given a graph, the fc-clique covering problem is to 
find the minimum number of fc-cliques to cover all the edges 
of the graph. To reduce this problem to the cluster-based 
HIT generation problem, we take each vertex of the graph 
as a record, and construct a set of record pairs, V , that con- 
sists of all edges of the graph. Let H\,H2, ■ ■ ■ ,Hh be the 
solution to the reduced cluster-based HIT generation prob- 
lem. Next we show that based on Hi, H2, •■• , Hh, we can 
obtain the solution to the original fc-clique covering problem 
in polynomial time. 

For each He (£ € [1, h]), we generate a clique, Ce that con- 
sists of the vertices corresponding to the records in He- Ob- 
viously, Ci, C%, • • • ,Ch are the minimum number of cliques 
that can cover all the edges of the graph. Since \He\ < k, Ce 
contains no larger than fc vertices. For each Ce (£ £ [l,/i]), 
we can simply construct a fc-clique, C' t , by adding fc — \He\ 
vertices into Ce, and finally obtain the solution to the k- 
cliquc covering problem, i.e. C[, C'2, ■ ■ ■ , C' h . Therefore, the 
fc-clique covering problem can be reduced to the cluster- 
based HIT generation problem in polynomial time. □ 

4. APPROXIMATION ALGORITHM 

In this section, we first reduce our problem to the k-clique 
edge covering problem, and then apply its approximation 
algorithm to cluster-based HIT generation. 
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(n, r 2 ) 




{U, r 6 ) 




(n, r 7 ) 




(r 3 , r 4 ) 




V4. '7^ 




(rs, r 9 ) 




(r2, r 3 ) 




(r2, r 7 ) 




(r 3 , r 5 ) 




(r 4 , r 5 ) 






r 8 1 1 r 9 



Figure 5: Build a graph based on the ten pairs of 
records to be verified in Figure 2(a). 

In order to reduce our HIT generation problem to the fc- 
clique covering problem, we build a graph based on the set 
of pairs that need to be verified. In the graph, each vertex 
represents a record, and each edge denotes a pair of records. 
A cluster-based HIT can be seen as a clique in the graph. We 
say the cluster-based HIT is able to check a pair if and only 
if the clique is able to cover the corresponding edge. (For 
simplicity, a cluster-based HIT is mentioned interchangeably 
with its corresponding clique in later text.) Therefore, the 
cluster-based HIT generation problem is reduced to finding 
the minimum number of cliques, whose sizes are no larger 
than fc, to cover all edges of the graph. Note that we only 
need to consider the cliques whose sizes are equal to k (i.e. 
fc-cliques) since a larger clique can cover more edges than a 
smaller one. Therefore, the reduced problem is the same as 
the fc-clique covering problem. 

To solve the fc-clique covering problem, we can model it 
as the set covering problem, and apply the corresponding 
approximation algorithm [8]. However, that algorithm is 
very expensive since it needs to generate (?) covering cliques 
where n is the number of the vertices in the graph. To 
address this problem, Goldschmidt ed al. [15] proposed an 
efficient (| + ^-^-approximation algorithm. The algorithm 
consists of two phrases. 

Phase 1: The algorithm creates a sequence consisting of 
all the vertices and edges in the graph, denoted by SEQ = 
{ei,e2, ••• ,e„}. Initially, SEQ is empty. Then the algo- 
rithm iteratively selects a vertex, and adds the vertex and 
the edges that contain the vertex into SEQ, and removes 
them from the graph. The algorithm iterates until the graph 
has no vertices or edges. 

Phase 2: Next, SEQ is used to generate fc-cliques to cover 
the edges of the graph. SEQ has the useful property that 
for any subsequence with fc — 1 consecutive elements, i.e., 
{d, ei+i, • • • , ei+fc-i}, the edges in the subsequence contain 
at most fc different vertices [15]. Therefore, these edges can 
be covered by a fc-clique. Based on this property, the algo- 
rithm divides SEQ into [ ^ fc-?^ 1 subsequences, where each 
subsequence has fc — 1 elements, and then finds a fc-clique 
for each subsequence. Since all the edges of the graph are 
in SEQ, the algorithm can find f 1 ^? 1 ! ^-cliques to cover 
all the edges. 

Example 2 shows how to use the above approximation al- 
gorithm to solve the cluster-based HIT generation problem. 

Example 2. Consider the ten pairs in Figure 2(a). To 
generate cluster-based HITs for them, we first build a graph 
as shown in Figure 5. The graph contains ten edges, which 



represent the ten pairs. Next we create a sequence SEQ 
which consists of all vertices and edges in the graph. Since 
the graph contains ten edges and nine vertices, there will be 
nineteen elements in SEQ. Suppose the cluster-size thresh- 
old is k = 4. We divide SEQ into \^^\ = 7 subse- 
quences, where each subsequence (except the last one) con- 
tains fc — 1 = 3 elements. We generate a cluster-based HIT 
to cover the edges in each subsequence. Therefore, the ap- 
proximation algorithm generates seven cluster-based HITs to 
verify the ten pairs in Figure 2(a). 

Note, however, that as described in Section 3.2 the op- 
timal solution requires only three cluster-based HITs. In 
fact, as we show in our experimental evaluation (Section 7), 
this approximation algorithm generates many more cluster- 
based HITs than even a naive algorithm on the data sets 
we tested. Thus, in the following section, we propose a new 
cluster-based HIT generation algorithm. 

5. A TWO-TIERED APPROACH 

In this section, we propose a two-tiered approach to ad- 
dress the cluster-based HIT generation problem. We first 
present an overview of our approach in Section 5.1, and then 
discuss the top tier and the bottom tier of our approach in 
Sections 5.2 and 5.3, respectively. 

5.1 Approach Overview 

Similar to the approximation algorithm in Section 4, our 
approach first builds a graph on a set of pairs. Since our hy- 
brid human-machine workflow typically only needs to check 
a small fraction of all possible pairs, the graph is very sparse, 
and may consist of many connected components. We clas- 
sify these connected components into two types according 
to the cluster-size threshold fc. Large connected components 
(LCCs) have more than fc vertices while small connected 
component (SCCs) have fc vertices or fewer. 

LCCs have more vertices than can fit into a cluster-based 
HIT; they must be partitioned into SCCs. When partition- 
ing, we would like to create SCCs that are highly connected 
since doing so increases the number of edges covered by the 
component, enabling more comparisons to be done in a given 
cluster-based HIT. The number of HITs required can also 
be reduced by batching multiple SCCs into a single cluster- 
based HIT. Different packing methods can lead to different 
numbers of required HITs. It is important to note that the 
approximation algorithm in Section 4 does not consider ei- 
ther of these issues that impact the number of HITs, rather 
it simply adds a random vertex and its corresponding edges 
into SEQ. 

For example, consider the graph in Figure 5. It con- 
sists of two connected components. Suppose fc = 4. 
The top one is an LCC and the bottom one is an SCC. 
Consider the top component containing seven vertices 
{ri,r2,r3,r4,r 5 ,re,r7}. Since it is an LCC, it must be 
partitioned. Assume it is partitioned into three SCCs 
{ri,r2,r 3 ,rr}, {r 3 ,r4,r 5 ,r 6 }, {r 4 ,r 7 }, which can cover all 
of its edges. Then the graph becomes {n, r<2, rs, rr}, 
{r 3 ,r4,r5,r 6 }, {r 4 ,r 7 }, {r g ,r 9 }. 

Next, we need to pack these SCCs into cluster-based 
HITs. One way is to create a cluster-based HIT for each 
of {r 3 , r4, rs, re} and {n, ri, r 3 , r-f\, and then to combine 
{r4,r 7 } and {rg,rg} into a third cluster-based HIT. This 
way, we generate only three cluster-based HITs. 
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Algorithm 1: Two-TieredCP, k) 



Input: V : a set of pairs of records 
k : a cluster-size threshold 
Output: Hi,H2,--- ,Hh: cluster-based HITs 

1 begin 

2 Let CC denote the connected components of the graph 
that is built based on V; 

3 SCC = {cc £ CC | | cc| < k}; //Small Connected Components 

4 LCC = {cc £ CC | | cc | > &:}; //Large Connected Components 

5 SCC U = PARTITIONING(LCC, k); //Top Tier 

6 Hi,H 2 ,--- ,H h = PACKING(SCC, fc); //Bottom Tier 

7 end 

Figure 6: An overview of two-tiered approach. 

Figure 6 shows the pseudo-code for this approach. In 
the initial step, we build a graph based on the given set of 
pairs, and divide the connected components of the graph 
into LCCs and SCCs (Lines 2-4). We then partition each 
of the LCCs into small ones so that we have a collection of 
SCCs (Line 5). Finally, we pack all the SCCs into cluster- 
based HITs (Line 6). 

5.2 LCC Partitioning (Top Tier) 

We now study the top tier of our approach, that is, given 
an LCC, how to partition it into SCCs such that its edges 
can be covered by these SCCs. As discussed in Section 5.1, 
we aim to create SCCs that are highly connected. Based 
on this idea, we devise a greedy algorithm. To partition an 
LCC, the algorithm iteratively generates an SCC with the 
highest connectivity, and iterates until the generated SCCs 
cover all edges in the large one. 

In each iteration step, the algorithm first initializes a small 
connected component sec with the vertex having the max- 
imum degree in the LCC. Then the algorithm repeats to 
add a new vertex into sec that maximizes the connectivity 
of sec. More specifically, for each vertex r (^ sec), the al- 
gorithm computes the indegree and the outdegree of r w.r.t 
sec, where the indegree is defined as the number of edges 
between r and the vertices in sec, and the outdegree is de- 
fined as the number of edges between r and the vertices not 
in sec. 

We select the vertex with the maximum indegree as that 
adds the most edges to sec. If there is a tie, that is, more 
than one vertex has the same maximum indegree, the algo- 
rithm selects the vertex with the minimum outdegree from 
these vertices since vertices with a larger outdegree have 
more connectivity with the vertices outside sec. The algo- 
rithm adds the selected vertex into sec, and updates the 
indegree and the outdegree of each vertex w.r.t the new sec, 
and repeats the above process to select another vertex. The 
algorithm stops adding vertices to sec when the size of sec is 
equal to the cluster-size threshold k, or when no remaining 
vertex connects with sec. 

Figure 7 shows the pseudo-code for the top tier. Each 
large connected component, Ice, is partitioned into SCCs 
as follows. First, it creates a small connected component, 
sec, with the vertex that has the maximum degree in Ice 
(Line 5). Let conn denote a set of vertices that connect 
with sec (Line 6). Next, the algorithm repeatedly picks up 
a vertex from conn and adds it into sec until either the size 
of sec is k or conn is empty (Lines 7-12). When picking up a 
vertex, it aims to maximize the connectivity of sec (Line 8). 
After adding a new vertex into sec (Line 9), the algorithm 



Algorithm 2: Partitioning(LCC, k) 

Input: LCC : a set of large connected components 

k : a cluster-size threshold 
Output: SCC : a set of small connected components 

obtained by partitioning each large connected 

component in LCC 

1 begin 

2 for each Ice £ LCC do 

3 while Ice has edges do 

4 Let r max be the vertex of Ice with the maximum 
degree; 

5 SCC = {rmax}\ 

6 conn = {r | for each edge (r max ,r) of lec}; 

7 while |scc| < k and |conn| > do 

8 Pick up a vertex r new from conn with the 
maximum indegree w.r.t sec. (If there is a 
tie, pick up the one with the minimum 
outdegree w.r.t sec); 
Move r new from conn to sec; 
for each edge (r new , r) of Ice do 

if r ^ sec and r ^ conn then 
|_ Add r into conn; 



9 
10 
11 
12 

13 
14 



15 end 



Add sec into SCC; 

Remove the edges of Ice that are covered by sec; 



Figure 7: The top tier of our approach. 

needs to update conn (Lines 10-12). Finally, the algorithm 
outputs sec and removes the edges of Ice that are covered 
by sec (Lines 13-14). If Ice still has edges, the algorithm 
returns to the first step and continues to generate more SSCs 
(Line 3). 
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Figure 8: An example of using the top tier of our 
approach to partition a large connected component 
in Figure 5 (k — 4). 

Example 3. For example, consider the LCC in Figure 5. 
To partition it into SCCs, in Figure 8(a) we first initial- 
ize scc = {r^} including a vertex with the maximum de- 
gree. Then we repeat to add a new vertex into scc until 
its size reaches the cluster-size threshold (i.e., \scc\ = k) or 
no more vertices are available (i.e., \conn\ — 0). In Fig- 
ure 8(b), there are four vertices connecting with scc, thus 
conn = {Y3, rs, re, rr}. For each vertex, we compute its in- 
degree and outdegree w.r.t scc, denoted by (indegree, outde- 
gree). Since rz, r$, ra and rj have the same indegree while 
r% has the minimum outdegree, we add re into scc. Simi- 
larly, in Figure 8(c) and (d), we respectively add 7-5 and 
into scc. 
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At this point, if we add more vertices, \scc\ will be larger 
than k, thus in Figure 8(e), we output sec and remove its 
edges from the LCC. We use the similar method to partition 
the remainder of the large connected component, and obtain 
two other SCCs in Figure 8(f). Thus, the LCC is ultimately 
partitioned into three SCCs: {r$, r±, r$, re}, {n, T2, T3, rj} 
and {r4, rj}. 

5.3 SCC Packing (Bottom Tier) 

We now describe the bottom tier of our approach, that is, 
given a set of SCCs, how to pack them into the minimum 
number of cluster-based HITs such that the size of each 
cluster-based HIT is no larger than k. This is a NP-Hard 
problem which is a variant of the one-dimensional cutting- 
stock problem [14]. We formulate it as an integer linear 
program. 

Let p — [ai, 02, • • ■ , Ofc] denote a pattern of a cluster-based 
HIT, where a? (j G [1, k]) is the number of SCCs in the 
HIT that contain j vertices. Since a cluster-based HIT can 
contain at most k vertices, we say that p is a feasible pattern 
only if J^3=i 3 ' a j — ^ holds. For example, suppose k = 4. 
pi = [0, 0, 0, 1] is a feasible pattern since 1 • + 2 • + 3 ■ 
+ 4- l = 4<fc holds. We collect all feasible patterns into 
a set A = {pi,p2, ■■■ ,p m }, where pi = [a il ,a l2 ,--- ,a ik ] 
(i G [l,m]). 

When packing a set of SCCs into cluster-based HITs, 
each HIT must correspond to a pattern in A. Let Xi de- 
note the number of cluster-based HITs whose pattern is pi 
(i G [l,m]). Then the problem becomes how to minimize 
the total number of patterns, i.e., YjiLi x i- Based on this 
idea, we can formulate our packing problem as the following 
integer linear program: 

m 

min Xi 

i = i 

m 

s.t. ^2,aijXi > Cj, V? G [1, k) 

Xi > 0, integer 

where Cj is the total number of the small connected compo- 
nents that contain j vertices. 

For example, given a set of SCCs, {7-3, ta, r$, re}, 
{ri,rn,r 3 ,rr}, {7-4, r 7 } and {r 8 ,r 9 }, we have ci = 0, c 2 = 2, 
C3 = and C4 = 2. To pack them into cluster-based 
HITs (k = 4), we first generate all feasible patterns, i.e. 
A = {pi = [0,0,0,l],p 2 = [0,2,0,01,^3 = [0,1,0,0]}. (Note 
that since ci = and C3 = 0, we omit the feasible patterns in 
A whose first or third dimension contains non-zero values.) 

Next we need to decide the number of cluster-based HITs 
corresponding to each feasible pattern, i.e. Xi, xi and X3. 
One possible solution is x\ = 2, X2 — and X3 = 2, which 
needs X/<=i Xi = 4 cluster-based HITs. We can easily ver- 
ify that the solution satisfies the constraint condition, i.e. 
Yli=i a ij x i _t c j (V? G [1, 4]). For example, when j = 2, we 
have Yh=i a i2 x i = 0- 2 + 2- 0+ l- 2>O2 = 2. However, 
this solution is not optimal since there is another solution 
x\ = 2, x% = 1 and X3 — that also satisfies the constraint 
condition but needs only Y^=i Xi = ^ cluster-based HITs. 

The above integer linear program can be solved by using 
column generation and branch-and-bound [25]. The tech- 
nique is very efficient as it does not need to generate all 
feasible patterns at the beginning. Instead, it starts with 



a few patterns and generates more patterns as needed. At 
each iteration, a branch-and-bound tree is built to search 
for the optimal integer solution. 

6. BACK OF THE ENVELOPE ANALYSIS 

In this section, we compare pair-based HITs with cluster- 
based HITs analytically in terms of the number of compar- 
isons they require workers to perform. For a pair-based HIT 
the number of comparisons required is simply the number of 
pairs that have been batched into the HIT. This is because 
each pair in the HIT is treated separately. For cluster-based 
HITs the story is more involved. 

Consider a cluster-based HIT with n records. One may 
be tempted to think that n '( n ~ 1 *> comparisons would be re- 
quired. However, in reality, the number of comparisons also 
depends on the way that a person does the HIT and the 
number of distinct entities represented in the HIT. Suppose 
the cluster-based HIT contains m distinct entities, denoted 
by ei, ei, • • • , e m , where (i G [l,wi]) represents the set of 
records in the HIT that refer to the i-th entity. Obviously, 
E'/=i \ei\=n. 

Assume a person does the cluster-based HIT as follows. 
First, she picks up a record from an entity, e.g. ei. Then 
she compares the record with the other n — 1 records. After 
n — 1 comparisons, she can identify all the records in the 
cluster-based HIT that refer to e\ . Next she selects a record 
from another entity, e.g. e 2 . Note that she does not need 
to compare it with the records in ei since those records 
correspond to the first entity and cannot refer to the second 
(different) entity. She can then identify all the records that 
refer to e 2 with n — 1 — |ei| comparisons. Iteratively, when 
selecting a record from ei, she only needs to compare with 
n — 1 — X]j=i \ e j I other records. By summing the number of 
comparisons in each iteration step, we can obtain the total 
number of comparisons required to complete the cluster- 
based HIT, i.e., 

m i — 1 

J2(n-1-J2\e 3 \). (1) 

i=l 3=1 

Based on this equation, we have the following two obser- 
vations. 

First, the value of Equation 1 decreases as |e 3 -| (j G [1, m]) 
increases. That is, a cluster-based HIT requires fewer com- 
parisons when it contains more matches (i.e., duplicates). 
To illustrate this observation, consider two extreme cases. 
One is no duplicate record exists in a cluster-based HIT. 
For this case, there are n entities in the cluster-based HIT 
and each entity contains only one record, thus the number of 
comparisons becomes . The other case is all records 

in a cluster-based HIT are duplicate. For this case, there 
is only one entity in the cluster-based HIT and the entity 
contains n records and the number of comparisons required 
is 71 — 1. 

The second observation is that the value of Equation 1 
differs in the sequence of the identified entities. For ease of 
presentation, we modify Equation 1 to the following equiv- 
alent equation. 

m — 1 

(n-l)-m - ^(m-i)-|e«|. (2) 

i=l 

The first part of the equation, i.e. (n — 1) ■ m, is a constant 
value w.r.t the sequence of the entities, while the second 
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Figure 9: An illustration of computing the number 
of comparisons for a cluster-based HIT. 

part, i.e. 2\^fc=i ( m — *) ' l e i|i i s a weighted sum of |e» | (i 6 
[l,m — 1]). Since the weight, i.e., (m — i), decreases with 
increasing i, the best way to identify entities is in increasing 
order of \ei\, which can result in the minimum number of 
comparisons. On the other hand, the worst way to identify 
entities is in decreasing order of \ei\. 

Example 4 shows how to use the above method to compute 
the number of comparisons for a cluster-based HIT. 



Table 2: Likelihood-threshold selection. 

(a) Restaurant Dataset 



Threshold 


Total #Pair 


Matches 


Recall 


0.5 


161 


83 


78.3% 


0.4 


755 


99 


93.4% 


0.3 


4,788 


105 


99.1% 


0.2 


23,944 


106 


100% 


0.1 


83,117 


106 


100% 





367,653 


106 


100% 



(b) Product Dataset 



Threshold 


Total #Pair 


Matches 


Recall 


0.5 


637 


335 


30.5% 


0.4 


1,427 


571 


52.1% 


0.3 


3,154 


805 


73.4% 


0.2 


8,315 


1,011 


92.2% 


0.1 


37,641 


1,090 


99.4% 





1,180,452 


1,097 


100% 



Example 4. In Figure 9, consider a cluster-based HIT 
with n — 4 records, {n, r%, r$, rr}. From Table 1, we can see 
that n, V2, r-j refer to the same entity, thus ei = {ri,r2,rr} 
and e2 = {r^}. Figure 9 shows the way a human worker 
does the HIT. First, the human initializes an entity e\ by 
selecting a record n. Then she compares ri with the other 
n — 1 = 3 records. Since r2 and r-j refer to the same en- 
tity as r\, she adds them into e\ by painting them the same 
color as n. We can see that after three comparisons, all 
of the records that refer to e\ have been identified. Next, 
the human worker initializes another entity e2 by selecting 
record r%. Since no record is left in the cluster-based HIT, 
i.e. n — 1 — | e\ \ — 0, there is no need to compare rs with any 
record. Therefore, the cluster-based HIT requires only three 
comparisons in total. One interesting observation is the hu- 
man worker actually checks four pairs of records (ri,r2), 
(ri,rr), (r2,r^) and (r2,rr) using only three comparisons. A 
pair-based HIT would require four comparisons. This shows 
that cluster-based HITs can require fewer comparisons than 
pair-based HITs. 

7. EXPERIMENTAL RESULTS 

We conducted extensive experiments to evaluate our ap- 
proaches. We address three issues here. First, we examine 
the effectiveness of our two-tiered HIT generation approach 
in reducing the number of HITs required for Entity Res- 
olution on real data sets. Second, we compare the qual- 
ity of the results produced by our hybrid human-machine 
approach with that produced by two machine-based ap- 
proaches. Finally, we compare pair-based and cluster-based 
HITs in terms of both answer quality and latency. 

7.1 Experimental Setup 

Datasets: We used two real datasets to evaluate our method. 

Restaurant 2 is a data set consisting of 858 (non-identical) 
restaurant records. It has 858 '( 8 ^ 58 ~ 1 ) _ 357^ 553 p a j rs Q f 
records in total, among which 106 pairs refer to the same 
entity. Each restaurant record has four attributes, [name, 
address, city, type]. An example record is: ["oceana", "55 e. 
54th St.", "newyork", "seafood"]. 



http: //www.cs.utexas.cdu/users/ml/riddle/data/rcstaurant.tar.gz 



Product 3 is a product data set integrated from two dif- 
ferent sources. There are 1081 records coming from the 
abt website and 1092 records coming from the buy website. 
The data set has 1081 * 1092 = 1, 180, 452 pairs of records, 
among which 1,097 pairs refer to the same entity. Each 
product record has two attributes, [name, price]. An exam- 
ple record is: ["Apple 8GB Black 2nd Generation iPod Touch 
- MB528LLA", "$229.00"]. 

The two datasets were preprocessed by replacing non- 
alphanumeric characters with white spaces, and letters with 
their lowercases. 

Machine-based Technique: Our hybrid human-machine 
workflow needs a machine-based technique to compute a 
likelihood for each pair of records. In our experiment, a sim- 
ple similarity-based technique, called simjoin, was adopted 
to achieve this goal. We first generated a token set for each 
record, which consisted of the tokens from all attribute val- 
ues. Then for each pair of records, we took the Jaccard 
similarity between their corresponding token sets as their 
likelihood. Since our workflow only crowdsources the pairs 
whose likelihood is above a threshold, Table 2 shows the ef- 
fect of different selections of thresholds on the two datasets. 
For example, for the Restaurant dataset, a threshold of 0.5 
retains 161 pairs. In reality 83 of these pairs refer to the 
same entity. The recall is = 78.3%, which means that 
78.3% matching pairs out of the total 106 matching pairs 
in the data set pass the threshold. From Table 2, we can 
conclude that a hybrid human-machine workflow can utilize 
machine-based techniques to significantly reduce the num- 
ber of the pairs with a little loss of recall. For instance, 
on the Product dataset, when the threshold is 0.2, we can 
achieve up to 92.2% recall by having people examine only 
8,315 pairs of records, which is over two orders of magnitude 
fewer than the total number of pairs (1,180,452). 

AMT: We use Amazon Mechanical Turk (AMT) to evalu- 
ate our hybrid human-machine workflow. AMT is a widely 
used crowdsourcing marketplace. We paid workers $0.02 for 
completing each HIT and paid AMT $0,005 for publishing 
each HIT. We ran over 8000 HITs and spent about $600 on 
AMT to evaluate our methods. All the HITs were published 
between 1800 and 2400 PST. We ran each experiment three 

http://dbs.uni-lcipzig.de/filc/Abt-Buy.zip 



1490 



times at the same time of day during the course of three 
days, and report the average performance. In addition, we 
used two ways to improve the result quality. (1) Assign- 
ment: AMT allows us to replicate one HIT into multiple 
assignments, and guarantees that each assignment can be 
done by a different worker. In our experiment, each HIT 
was replicated into 3 assignments. That is, we obtained the 
results of a HIT from three different workers and made our fi- 
nal decision based on a combination of the three results (see 
Section 7.3). (2) Qualification Test: We found that some 
workers may do our HITs maliciously. In order to prevent 
this, AMT supports qualification tests for workers, and only 
those who successfully pass the test can do our HITs. In our 
experiment, the qualification test consists of three pairs of 
records. For each one, a worker needs to decide whether or 
not they match. Workers must get all three pairs correct to 
pass the qualification test. 

7.2 Cluster-based HIT Generation 

In this section, we evaluate the two-tiered approach for 
cluster-based HIT generation. We compare with the follow- 
ing baseline algorithms: 

Random: The algorithm generates cluster-based HITs by 
randomly selecting records from a set of pairs of records, V '. 
To generate a cluster-based HIT, H, it repeatedly selects a 
pair of records from V and merges the two records into H. 
When | HI = k, it outputs H, and removes the pairs from 
V . If V still has pairs, the algorithm will repeat the above 
process to generate new cluster-based HITs; otherwise, the 
algorithm terminates. 

BFS-based: The algorithm first builds a graph on a given 
set of pairs of records, and then generates cluster-based HITs 
according to the breadth-first-search (BFS) of the graph. To 
generate a cluster-based HIT, H , the algorithm traverses the 
graph using BFS, and adds the vertices (i.e. records) into 
H in the traversal order. When \H\ = k, it outputs H, 
and removes the edges that can be covered by H from the 
graph. If the graph still has edges, the algorithm will re- 
peat the above process to generate new cluster-based HITs; 
otherwise, the algorithm terminates. 

DFS-based: Similar to BFS-based but traverses the graph 
using depth-first-search (DFS). 

Approximation: The k-clique approximation algorithm 
described in Section 4. 

We first compare the two-tiered approach with the base- 
line algorithms for various likelihood thresholds. We varied 
the likelihood threshold from 0.5 to 0.1 on the Restaurant 
and Product datasets, and used the different approaches to 
generate cluster-based HITs. Figure 10 shows the number 
of generated cluster-based HITs (k = 10) as the threshold is 
varied. We can see that the two-tiered approach generated 
the fewest cluster-based HITs, with the differences being 
greater for smaller thresholds. Note that in order to achieve 
a high recall, we need to select a smaller threshold (Table 2). 

In terms of the baseline algorithms, we have the following 
observations. First, the BFS-based algorithm was the best 
baseline algorithm. This is because the BFS traversal of 
the graph can generate cluster-based HITs with highly con- 
nected vertices. Second, the approximation algorithm did 
not perform well on the real datasets. For example, on the 
Restaurant dataset, when the threshold is 0.1, it performed 
worst. Third, the naive random algorithm generated many 
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Figure 10: Comparison of the number of cluster- 
based HITs for various likelihood thresholds 
(cluster-size=10). 

more cluster-based HITs than the two-tiered approach. For 
example, on the Product dataset, the random algorithm gen- 
erated 6422 HITs with threshold 0.1, while the two-tiered 
approach generated only 2033 HITs. 

Next we compare two-tiered approach with the baseline 
algorithms for various cluster-size thresholds. We varied 
the threshold from 5 to 20 on the Restaurant and Product 
datasets, and compared the number of cluster-based HITs 
generated by different approaches. Figure 11 shows the re- 
sults with likelihood threshold = 0.1. We can see that for 
all cluster-size thresholds tested, our two-tiered approach 
generated the minimum number of cluster-based HITs. For 
example, on the Restaurant dataset the two-tiered approach 
generated 1.9 to 2.3 times fewer cluster-based HITs than the 
best baseline algorithm. 
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Figure 11: Comparison of the number of cluster- 
based HITs for various cluster-size thresholds (like- 
lihood threshold=0.1). 

7.3 Entity-Resolution Techniques 

In this section, we compare the hybrid human-machine 
workflow with existing entity resolution techniques. We use 
two metrics to evaluate the result quality: (1) precision is 
the percentage of correctly identified matching pairs out of 
all pairs identified as matches; (2) recall is the percentage 
of correctly identified matching pairs out of all matching 
pairs in the dataset. As more matches are identified, recall 
increases while precision potentially decreases. 

We show our results as precision-recall curves [4], gener- 
ated as follows. We assume the result of an entity-resolution 
technique is a ranked list of pairs of records, where the list is 
sorted based on the decreasing order of the likelihood that a 
pair of records match. In the list, the first n pairs are iden- 
tified as matching pairs. To plot the precision-recall curve, 
we vary n and plot the precision vs. the recall. 

We implemented the following entity-resolution techniques. 

simjoin: This is the machine-based technique used by our 
hybrid human-machine workflow (see Section 7.1). 
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Figure 12: Comparing hybrid human- machine work- 
flow with existing machine-based techniques. 

SVM: This is a state-of-the-art learning-based technique. 
First, we computed a feature vector for each pair of records. 
For the Restaurant dataset, we chose two similarity func- 
tions, edit distance and cosine similarity, adopted by [18], 
and computed their values on four attributes to obtain a 
8-dimcnsional feature vector; For the Product dataset, we 
chose the same two similarity functions and computed their 
values on the Name attribute to obtain a 2-dimensional fea- 
ture vector. Next we trained a classifier on 500 pairs that 
were randomly selected from the pairs whose Jaccard sim- 
ilarities were above 0.1 (Note that the training pairs were 
sampled 10 times, and we report the average performance 
here). Finally, SVM returned a ranked list of the remaining 
pairs sorted based on the likelihood given by the classifier [4] . 
hybrid: Our hybrid human-machine workflow first uses simjoin 
to obtain a set of pairs based on a specified threshold, and 
then verifies these pairs by using the cluster-based HITs 
generated by the two-tiered approach with k — 10. For 
the Restaurant dataset, simjoin returned 2004 pairs (102 
matching pairs, 96.2% recall) based on the specified thresh- 
old 0.35, and the two-tiered approach generated 112 cluster- 
based HITs. On the Product dataset, simjoin returned 8315 
pairs (1,011 duplicate pairs, 92.2% recall) based on the spec- 
ified threshold 0.2, and the two-tiered approach generated 
508 cluster-based HITs. We posted these HITs on AMT 
and replicated each HIT into three assignments. Thus, we 
spent 112 * 3 * 0.025 = $8.4 on the Restaurant dataset, and 
508 * 3 * 0.025 = $38.1 on the Product dataset. Finally, the 
hybrid human-machine workflow returned a ranked list of 
the pairs sorted based on the results of the crowd. 

One detail we need to mention is the way we combined the 
answers from the three different assignments for each HIT. 
A simple technique would be to average the three responses 
for each HIT, but this approach is susceptible to spammers. 
Instead we adopted the EM-based algorithm [9], which has 
been shown to be effective in previous work [16,19]. 

We first compare hybrid with simjoin and SVM on the 
Restaurant and Product datasets. Figure 12 shows the re- 
sults. In the figure, hybrid and hybrid (QT) respectively de- 
note the hybrid workflow with and without a qualification 
test. For the Restaurant dataset, we can see that hybrid and 
hybrid(QT) achieve the same quality as SVM. This indicates 
that the hybrid human-machine workflow based on a simple 
non-learning based technique (i.e. simjoin) can have a com- 
parable performance to a sophisticated learning based tech- 
nique (i.e. SVM). On the Product dataset, we can see that 
hybrid and hybrid (QT) achieved significantly better quality 
than simjoin and SVM. This indicates that for datasets for 
which the machine-based techniques were unable to perform 
well, a hybrid human-machine workflow still can achieve 
very high quality. 



Note that to further improve the recall of the hybrid work- 
flow, we can specify a smaller likelihood threshold thereby 
asking the crowd to perform more HITs. For example, in 
Figure 12(b), hybrid and hybrid(QT) can achieve at most 
92.2% recall. In contrast, as shown in Table 2(b), our hy- 
brid workflow, if used with a likelihood threshold of 0.1, 
could achieve up to 99.4% recall at the cost of crowdsourc- 
ing 37, 641 pairs. 

Next we compare hybrid with hybrid (QT). The results in 
Figure 12 show that adding a qualification test can in fact 
help to improve the result quality. There are mainly two 
reasons for this. First, a qualification test can weed out 
spammers since they are very likely to fail the test. Second, 
is that a qualification test can force workers to read our in- 
structions more carefully. However, while the qualification 
test can improve quality, this improvement may come at a 
steep cost in terms of latency. For the Restaurant dataset, 
hybrid and hybrid(QT) required 1.3 hours and 1.6 hours re- 
spectively to complete 112 HITs; on the Product dataset, 
hybrid and hybrid (QT) required 4.5 hours and 19.9 hours re- 
spectively to complete 508 HITs. 

7.4 Pair-based vs. Cluster-based HITs 

Having shown the benefits of hybrid Entity Resolution 
compared to both machine-based and human-based meth- 
ods, we now turn to examining the relative performance 
of pair-based vs. cluster-based HITs. As described in Sec- 
tion 6, the benefit of the cluster based approach in terms of 
number of comparisons depends on the number of matching 
pairs in the data set. Thus, in this comparison, in addi- 
tion to the Product data set used previously, we also use an 
additional dataset we created called Product+Dup that has 
more matching pairs than the datasets used in the previous 
experiences. 

We created the Product+Dup by randomly selecting 100 
records from the Product dataset, and then adding x match- 
ing records for each base record, where x is an integer ran- 
dom variable uniformly distributed on [0, 9]. Matches were 
generated by randomly swapping two tokens in the base 
record. Product+Dup has 157, 641 pairs of records, among 
which 1713 pairs are matches. 

We generated pair-based and cluster-based HITs using a 
likelihood threshold of 0.2. We set the cluster size (k = 10), 
which we denote by Cio- In order to keep cost constant 
across the two methods, we created pair-based HITs that 
contained enough pairs so that the number of HITs gen- 
erated for both methods was the same. For the Product 
dataset, there were 8315 pairs that needed to be crowd- 
sourced, resulting in 508 cluster-based HITs. In order to 
generate the same number of pair-based HITs, we needed 
to generate pair-based HITs containing = 16 pairs on 
average, denoted by Pi6. Similarly, for the Product+Dup 
dataset, there were 3401 pairs that needed to be crowd- 
sourced, resulting in 120 cluster-based HITs, and 120 pair- 
based HITs, containing (^j§y = 28) pairs on average, de- 
noted by P28. 

We first compare the median completion time per assign- 
ment between pair-based HITs and cluster-based HITs. Fig- 
ure 13 shows the results. QT represents the experimental 
results with a qualification test. We can see that the time 
to complete a single HIT was lower for the cluster-based 
HITs than for the pair-based HITs in these experiments. 
For the Product dataset, Figure 13(a), a cluster-based HIT 
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Figure 13: Comparison of median completion time 
per assignment between a pair-based HIT and a 
cluster-based HIT. 
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Figure 14: Comparison of the time of completing all 
pair-based HITs and all cluster-based HITs. 

consumed about 15% less time than a pair-based HIT. The 
difference was more dramatic for the Product+Dup dataset, 
which contains more matching pairs. 

However, the results are somewhat different when con- 
sidering the total time taken to receive all of the results 
(rather than a single HIT). These results are shown in Fig- 
ure 14. Surprisingly, for the Product dataset, pairs-based 
HITs were completed earlier. We find the pair-based HITs 
attracted more workers. This may be due to the unfamiliar 
interface of cluster-based HITs that makes workers feel that 
it is harder to complete. Since we did these experiments 
on AMT, we did not have specifically trained workers who 
were familar with the cluster-based interface. For the Prod- 
uct+Dup dataset, however, the efficiency of cluster-based 
HITs in the presence of more matches led to advantages in 
overall completion time as well. Recall that pair-based HITs 
containing on average 28 pairs were required to produce the 
same number of HITs as the cluster-based approach, com- 
pared to only 16 for the Product dataset. Since we kept 
the price per HIT constant, fewer workers were attracted to 
perform the pair-based HITs in this case. 
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Figure 15: Comparison of the quality of pair-based 
HITs and cluster-based HITs. 

Finally, Figure 15 shows that the result quality for pair- 
based HITs and cluster-based HITs was similar in these ex- 
periments. 



8. RELATED WORK 

Entity resolution is a critical task for data integration and 
cleaning. It has been studied extensively for several decades 
(see [11] for a survey). Some existing work has investigated 
how to benefit from human interaction. Sarawagi et al. [24] 
proposed an entity-resolution method using active learning, 
which allows a user to label a training set interactively. They 
have shown the method can significantly reduce the size of 
the training set needed to achieve high accuracy. Arasu et 
al. [1] proposed another active-learning approach which is 
more scalable to large datasets and is able to give proba- 
bilistic guarantees on the quality of the results. Jeffery et 
al. [17] studied user feedback in pay-as-you-go data integra- 
tion systems. In such systems, there exist some candidate 
duplicate pairs of elements (e.g. attribute names or values) 
that require user feedback for verification. They proposed a 
decision-theoretic approach to determine the order in which 
these pairs should be verified. McCann et al. [21] studied 
schema matching in online communities. They generated 
different types of questions to ask community members, and 
derived schema-matching results from the answers of these 
questions. 

Recently, crowdsourcing has attracted significant atten- 
tion in both the industrial and academic communities 
(see [10] for a recent survey). Recent projects in the 
database community aim to embed crowdsourcing into 
database query processing. Franklin et al. [12,13] extended 
relational database query language SQL to CrowdSQL by 
enabling crowd-based operators, and built CrowdDB, a re- 
lational query processing system based on CrowdSQL. Mar- 
cus et al. [19] integrated SQL with crowd-based user defined 
functions (UDFs), and proposed Qurk, a declarative work- 
flow management system. Parameswaran et al. [22] pre- 
sented Deco, a database system for declarative crowdsourc- 
ing. In addition, there are many hybrid human-machine 
systems being developed outside of the database commu- 
nity. CrowdSearch [27] is an image searching system, that 
combines automated image search with real-time human val- 
idations of search results. Solyent [3] is a word proces- 
sor that utilizes crowd workers to shorten, proofread, and 
edit documents. Although there are many studies in crowd- 
sourcing, to the best of our knowledge, no existing work 
has explored how to improve entity resolution using hybrid 
human-machine techniques combining a generic microtask 
crowdsourcing platform with machine-based techniques. 

There are also some studies on blocking which consider 
partitioning of a table of records to maximize matching 
record pairs co-occurring in given partitions [7]. Although 
our cluster-based HIT generation problem is a form of block- 
ing, it differs from the typical blocking problem. Firstly, 
the fact that our block size is constrained by what people 
can do is different than what determines block size typically 
(which is that beyond a certain point, increasing the block 
size does not reduce the complexity). Secondly, since the 
financial cost of human comparisons is driven by the num- 
ber of tasks (i.e., blocks), our goal is to minimize number of 
blocks of the given size, which is a different objective than 
that of previous work. 

9. CONCLUSION AND FUTURE WORK 

In this paper we have studied the problem of crowdsourc- 
ing entity resolution. We described how machine-only ap- 
proaches often fall short on quality, while brute force people- 
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only approaches are too slow and expensive. Thus, we pro- 
posed a hybrid human-machine workflow to address this 
problem. In the context of this hybrid approach, we in- 
vestigated pair-based and cluster-based HIT generation. In 
particular, we formulated the cluster-based HIT generation 
problem, and showed that it is NP-Hard. We then de- 
vised a heuristic two-tiered approach to solve this prob- 
lem. We described this method, and presented the results 
of extensive experiments on real data sets using the AMT 
platform. The experiments show that our hybrid approach 
achieves both good efficiency and high accuracy compared 
to machine-only or human-only alternatives. In particular, 
the results indicated that (1) the two-tiered approach gener- 
ated fewer cluster-based HITs than existing algorithms; (2) 
hybrid human-machine workflow significantly reduced the 
number of HITs compared to human-based techniques, and 
achieved higher quality than the state-of-the-art machine- 
based techniques; and (3) the cluster-based HITs can pro- 
vide lower latency than a pair-based approach, particularly 
in the presence of many matching records, but that the sim- 
plicity of the pair-based interface seemed to be appealing to 
AMT workers. 

Our work represents an initial approach towards hybrid 
human-machine entity resolution. There are many further 
research directions to explore. First of all, we were surprised 
that some AMT workers preferred to do the relatively large 
pair-based tasks over the much smaller cluster-based tasks, 
even though the price paid for them was identical. We be- 
lieve that this could be due in part to worker's lack of famil- 
iarity with the cluster-based interface, which if true, raises 
the possibility that different approaches could have very dif- 
ferent performance when applied to experienced vs. novice 
crowds. There are also many user interface improvements 
that could be made to both the pair-based and cluster-based 
interfaces, which could have dramatic effects on cost, quality 
and latency. 

Another issue to be addressed is that of scaling to much 
larger datasets. Our approach utilizes machine-based tech- 
niques to remove dissimilar pairs. However, for a data set 
with millions of records, there will be a large number of re- 
maining pairs need to be verified by the crowd. Therefore, 
we need to explore how to make a better use of machine- 
based techniques to further offload relatively expensive crowd 
resources. A related issue is the development of a budget- 
based approach to hybrid entity resolution. Users may wish 
to trade off cost, quality and latency, and the development 
of tools and algorithms to facilitate such tradeoffs is a deep 
research challenge. Finally, we would like to extend these 
techniques to take privacy into consideration. Sometimes, 
the data to be integrated is confidential, so other techniques 
will be required to allow crowds to process the data. 
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